Example for dimensionnality reduction



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
my_data = pd.DataFrame([1,2,3])

convert integer in to binary string



In [8]:

    
def to_binary(value):
    return "{0:b}".format(value)



In [10]:

    
to_binary(5)









    Out[10]:





'101'



In [12]:

    
unique_values = my_data.thrid.unique()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-8db6822080d0> in <module>()
----> 1 unique_values = my_data.thrid.unique()

/Library/Python/2.7/site-packages/pandas-0.18.0-py2.7-macosx-10.11-intel.egg/pandas/core/generic.pyc in __getattr__(self, name)
   2667             if name in self._info_axis:
   2668                 return self[name]
-> 2669             return object.__getattribute__(self, name)
   2670 
   2671     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'thrid'

we apply this to a data we use enumerate to loop then save a dictionary to encapsulate of the values that are unique in binary code, so it give you the index possition and the value, now what we do is that we create a dictionary that maps this individual string in to binary.



In [14]:

    
my_dict = {}
for index,val in enumerate(unique_vals):
    my_dict[val] = to_binary(index)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-60ee4666a676> in <module>()
      1 my_dict = {}
----> 2 for index,val in enumerate(unique_vals):
      3     my_dict[val] = to_binary(index)
      4 

NameError: name 'unique_vals' is not defined



In [ ]:

    
my_data["thrid_binary"] = my_data.apply(lambda x: my_dict[x.thrid], axis = 1)

then you will want to split my data binary in to columns and so you split the string in to nothing. you can put a separator in between them and then split on that separator. You have to look at the histogram of the column and see if the top n account for the majority. if they do, you can use those ones, then you'd have a partition for those columns.
It would still be a better model.



In [ ]: